Initially, we plotted the blotted weights by body weights, and we noticed that the body weights seemed to be separated into two groups.

But after color coding the chart by protocol (A,B,C,D) we noticed that the A & B protocols were grouped together and the C & D protocols were grouped together. In class, we were told that two experimental models were used, one for juvenile female rats and another for adult female ovariectomized rats; we were also told that each experimental model was associated with two protocols each. Based on this knowledge, we suspect that the gap in observed weights is likely due to A & B protocols consisting of juvenile (i.e. smaller) rats.

We then created boxplots of body and blotted weight observations across each lab. The data appeared to be right skewed, so we took the log of the response variable and replotted. The resulting boxplots appear to be more normally distributed. It also appears that all the weights vary by lab. It is also worthwhile to note that body weight and uterus weight are not measured on the same scale, but we were not given the units used for either. So, the relative size of each box plot is not to scale.

Next, we plotted the body and blotted weight observations according to the protocol that was applied. Again, we logged each of the weight values to account for data skew.

Upon plotting all combinations of dosages across groups (each group represents a different combination of dose 1 and dose 2), we notice a couple of trends. First, we see that as the quantity of dose 1 increases without dose 2, the log blotted weight increases. Secondly, when dose 2 is introduced, an increase in quantity of dose 2 leads to a decrease in log blotted weight.

dosage_groups <-  dat %>% 
                  group_by(group) %>% 
                  summarize(dose1 = unique(dose1), 
                            dose2 = unique(dose2),
                            A = sum(proto == "A"),
                            B = sum(proto == "B"),
                            C = sum(proto == "C"),
                            D = sum(proto == "D")
                  ) %>% as.data.frame()
doselabels <- apply(dosage_groups[,c("dose1", "dose2")], 1, paste, collapse = ", ") %>% paste("(", ., ")", sep="")

ggplot(data = dat ,
       mapping=aes(x=as.factor(group), y=log(blot))) + 
  geom_boxplot() + 
  labs(title = "log(Blotted Weight) by Dosage",
       x = "Combination of (Dose 1, Dose 2)",
       y = "log(Blotted Weight)") +
  scale_x_discrete(labels = doselabels) + 
  theme(axis.text.x = element_text(angle = 45, hjust=1),
        plot.title = element_text(hjust = 0.5))

kable(dosage_groups)
group dose1 dose2 A B C D
1 0.00 0.0 89 72 54 24
2 0.00 0.0 95 72 54 24
3 0.01 0.0 84 72 54 24
4 0.03 0.0 90 72 54 23
5 0.10 0.0 96 72 54 24
6 0.30 0.0 96 72 54 24
7 1.00 0.0 96 72 54 24
8 3.00 0.0 96 72 54 24
9 10.00 0.0 96 71 54 24
10 3.00 0.1 96 72 53 24
11 3.00 1.0 96 72 54 24

Initial Model-Fitting Results

First Approach: Univariate Normal

\[ log(y) \sim \text{Norm}(\mu , \sigma^2) \]

This naive approach assumes the blotted weight follows a normal distribution with a mean centered around the mean of the log blotted weight (4.2618514) constant variance. The diagnostic plots below show that the same value is predicted for each input, the quantiles of the data do not follow a normal distrbution, and that the residuals are bimodal. We believe bimodality may be caused by the separation of rats into children and adult protocols.

Second Approach: Multivariate Normal

\[ y_i, \mu \in M_{n_i \times 1}( \Re) \\ \Sigma \in M_{n \times n}( \Re) \\ log(y_i) \sim \text{MVNorm}(\mu, \Sigma) \]

In this model,\(y_i\) is a vector of \(n_i\) observations from lab \(i\). It assumes that the log blotted weights of each lab follow approximately normal distributions with a their own means.

We provide the diagnosis plots for the first lab, Basf, and we put the others in Appendix 1.1, since R does not support plotting for multivariate regression models.

The diagnostic plots for this model indicate that the model predicts the same fitted value for each input, and that the residual quantiles do not follow a normal distribution. For the first lab, the histogram of residuals is not bimodal like they are in the simpler \(mod1\), but it does show left skew. These same problems, in addition to multimodality of the residuals, are present for the diagnostic plots for the remaining labs in the Appendix.

A weakness of this model is that, as you add more structure (blocking factors, covariates for dosage, etc.) to the model, the covariance matrix becomes more complicated, and maximum likelihood estimation becomes unwieldy. The poor fit of this model indicates that we should control for more than just the lab effect, so we consider a mixed effects model.

Third Approach: Mixed Effects

\[ y_i \in M_{n_i \times 1}( \Re) \\ \mu_i \sim \text{Norm}(\mu, \psi^2) \\ \sigma_i^2 \sim \text{IG}(\nu/2, \nu\sigma^2/2) \\ y_{ij} \sim \text{Norm}(\mu_i, \sigma^2_i) \]

Let \(y_i\) be a \(n_i \times 1\) vector of observations, where \(i\) indicates the lab and \(n_i\) is equal to the number of observations within lab \(i\). So, \(y_{ij}\) represents an individual observation for subject \(j\) within lab \(i\). We apply a Gaussian assumption here and say that each observation is normally distributed for some \(\mu_i\) and \(\sigma^2_i\).

We then add a random effect on \(\mu_i\), so \(\text{Norm}(\mu, \psi^2)\) gives us a population distribution characterizing lab-to-lab variability. This allows our model to be generalizable to future labs. Furthermore, when calculating the MLE of each \(\mu_i\), there is less variability because they are pulled in toward the grand mean \(\mu\). This introduces bias, but the decrease in varaiance results in an overall benefit for the MSE.

Analysis
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: log(blot) ~ dose1 + dose2 + (1 | lab)
##    Data: dat_train
## 
##      AIC      BIC   logLik deviance df.resid 
##   4134.9   4162.9  -2062.4   4124.9     2003 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.15212 -0.85658  0.02767  0.68697  2.79594 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  lab      (Intercept) 0.1065   0.3264  
##  Residual             0.4434   0.6659  
## Number of obs: 2008, groups:  lab, 19
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  3.947951   0.077602   50.87
## dose1        0.140596   0.005203   27.02
## dose2       -0.517096   0.051522  -10.04
## 
## Correlation of Fixed Effects:
##       (Intr) dose1 
## dose1 -0.121       
## dose2 -0.051 -0.133
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: log(blot) ~ dose1 + dose2 + proto + (1 | lab)
##    Data: dat_train
## 
##      AIC      BIC   logLik deviance df.resid 
##   2465.0   2509.8  -1224.5   2449.0     2000 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6148 -0.7215 -0.3028  0.7392  2.7750 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  lab      (Intercept) 0.02841  0.1686  
##  Residual             0.19331  0.4397  
## Number of obs: 2008, groups:  lab, 19
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  3.631080   0.042728   84.98
## dose1        0.142003   0.003435   41.34
## dose2       -0.519480   0.034020  -15.27
## protoB       0.037860   0.027433    1.38
## protoC       1.257924   0.030492   41.25
## protoD       1.257420   0.039567   31.78
## 
## Correlation of Fixed Effects:
##        (Intr) dose1  dose2  protoB protoC
## dose1  -0.148                            
## dose2  -0.059 -0.133                     
## protoB -0.251  0.009 -0.010              
## protoC -0.234  0.010 -0.002  0.467       
## protoD -0.159  0.009 -0.010  0.342  0.366

In our analysis, we choose to add a random effect for the lab variable, in order to account for lab-to-lab heterogeneity. Essentially, this allows us to avoid violating an independence assumption by assuming a different baseline response for each lab. First, we model these differences between individual labs by assuming random intercepts for each lab. We first include only dose1 and dose2 as fixed effects, then introduce proto as a fixed effect after confirming through anova (\(p\) < 2.2e-16, \(\chi^2\)=NA, 1675.8775902) that it is a significant predictor.

With this full model, we see that the variability due to lab is 0.028 (in terms of variance), and variability due to non-lab sources is 0.193. And when we take a look at the fixed effects, we see that an increase in the amount of dose1 corresponds to an increase (0.142 units) in the response variable, log(blot). Furthermore, an increase in the amount of dose2 corresponds to a decrease (-0.519 units) in the response variable. Protocols B, C, and D all lead to an increase in the response variable, relative to protocol A. The diagnostic plots below show that this model fits better than the previous two. Although the residual variance decreases as the fitted values increase, the quantiles of the residuals are approximately normal.

In our previous random intercept model, we account for baseline differences between labs, but we also assume that the effects of doses is the same for each lab. After plotting the log(blot) versus dose by lab, we see that this is not a valid assumption to make since the lines are clearly not parallel. So we introduce a random slope model. Now, dose1 and dose2 can have varying slopes.

In this random slope model (m4), we see that the variability due to sources other than the random effects is 0.079 (decreased from 0.192).

The diagnostic plots for this model are similar to those for the previous model with fixed intercepts.

We also evaluate the mixed models by how well they can predict data that they were not trained on. Using a hold-out sample of 25% of the data set, we calculated the results for Root Mean Squared Error (\(\text{RMSE} = \sqrt{E[(y - \hat{y})^2]}\)) and Mean Absolute Error (\(\text{MAE} = E[|y - \hat{y}|]\)). Root Mean Squared Error penalizes more for extreme error, while Mean Absolute Error simply averages all of the errors. The results are in the table below:

MAE RMSE
Random Dose 36.41391 60.89582
Fixed Dose 35.23408 57.98964

The models are similar in their predictive accuracy, although Fixed Dose has a marginally better RMSE and MAE for this testing data set.

TODO: plot lmer results

if you plot the line for each lab and they look similar, the dose response is probably reliable if there’s a ton of variance, it’s probably not reliable because there’s a lot of variability between labs

Contributions

Nathaniel Brown made the visualizations for this report. He also organized the relevant files in a Github repository for the group to access and edit. Annie Tang compiled the group work done on EDA into a .rmd and wrote the accompanying explanations for the EDA and approaches to analysis. William Yang helped pair on EDA analysis and identify approaches to handle the data. Approaches to analysis were a joint effort by all members of the group.

Appendix

1.1 Multivariate Normal Diagnosis Plots